Sprint 0 - Presentation & Data Cleaning¶

The LendingClub - Loan Prediction Dataset¶

Viewing recommendations¶

For best experience open the pre-rendered .html file

You have 2 options depending on how much you trust my code and how little time you want to dedicate to this review :)

  1. ✅ open the prerendered HTML file, which contains all of the data without any of the rendering limitations that github suffers from
    • Best if you just want to review the analysis and conclusions You have two options for optimal viewing experience:
  2. ✅ open the notebook in jupyterlab (instead of github) and re-run all cells
    • Best if you want to verify that the code works and that the data is not fake

Why you should NOT visualize this notebook on github

Some charts/diagrams/features are not visible in github.

  • This impacts all:
    • plotly plots
    • folium maps
    • embedded images
    • and all other dynamic content
    • quarto meta-tags that improves visualization

This is standard and well-known behaviour.

  • While some workarounds could be possible, there are no universal fixes that resolve all the issues.
    • Plotly customization requires installing external tools, but doesn't fix the other issues
    • Standard workarounds (nbviewer) do not work because the github repo is private.

If you chose to run this locally, there are some prerequisites:

  • you will need python 3.9
  • you will need to install the dependencies using pip install -r requirements.txt before proceeding.

Context¶

Imagine that you are a data scientist who was just hired by the LendingClub. They want to automate their lending decisions fully, and they hired you to lead this project. Your team consists of a product manager to help you understand the business domain and a software engineer who will help you integrate your solution into their product. During the initial investigations, you've found that there was a similar initiative in the past, and luckily for you, they have left a somewhat clean dataset of LendingClub's loan data.

In the first meeting with your team, you all have decided to use this dataset because it will allow you to skip months of work of building a dataset from scratch. In addition, you have decided to tackle this problem iteratively so that you can get test your hypothesis that you can automate these decisions and get actual feedback from the users as soon as possible.

For that, you have proposed a three-step plan on how to approach this problem:

  • The first step of your plan is to create a machine learning model to classify loans into accepted/rejected so that you can start learning if you have enough data to solve this simple problem adequately.
  • The second step is to predict the grade for the loan,
  • The third step is to predict the subgrade and the interest rate.

Your team likes the plan, especially because after every step, you'll have a fully-working deployed model that your company can use. Excitedly you get to work!

Imports and initial setup¶

In [92]:
import ipywidgets as widgets
from IPython.display import display, Markdown, Image, clear_output, HTML
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from scipy.stats import chi2_contingency
import missingno as msno
import pygwalker as pyg
import pygwalker_utils.config as pyg_conf

import statsmodels.api as sm

from random import random, seed
import sqlite3 as lite
import logging
import warnings
import ydata_profiling
import iplantuml
import xml.dom.minidom

from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    accuracy_score,
    recall_score,
    precision_score,
)
from sklearn.preprocessing import MinMaxScaler, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.tree import plot_tree
from sklearn.metrics import make_scorer, confusion_matrix, PrecisionRecallDisplay
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.base import clone as clone_pipeline

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import shap

from imblearn.under_sampling import RandomUnderSampler
import joblib

import dask.dataframe as dd
from dask_ml.model_selection import train_test_split as dask_train_test_split

import os
from os import path

from utils import *
from utils import __

from autofeat import AutoFeatRegressor as AFR
from sklearn.model_selection import train_test_split as sklearn_train_test_split
from sklearn.exceptions import DataConversionWarning

import category_encoders as ce

seed(100)
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 50
util.check("done")
✅
In [93]:
%reload_ext loans_clean
%reload_ext loans_utils

import loans_clean
from loans_utils import *

Let's use black to auto-format all our cells so they adhere to PEP8

In [94]:
import lab_black

%reload_ext lab_black
util.patch_nb_black()
# fmt: off
# fmt: on
In [95]:
from sklearn import set_config

set_config(transform_output="pandas")
In [96]:
logger = util.configure_logging(jupyterlab_level=logging.WARN, file_level=logging.DEBUG)

warnings.filterwarnings("ignore", category=FutureWarning)

# import warnings
# warnings.filterwarnings('error', category=pd.errors.DtypeWarning)
In [97]:
def ding(title="Ding!", message="Task completed"):
    """
    this method only works on linux
    """
    for i in range(2):
        !notify-send '{title}' '{message}'

Feature Toggles¶

Let's also create a simple feature toggle that we can use to skip expensive operations during notebook work (to save myself some time!)

Set it to true if you want to run absolutely everything. Set to false to skip optional steps/exploratory work.

In [98]:
def run_entire_notebook():
    run_all = False
    if not run_all:
        print("skipping optional operation")
    return run_all
In [99]:
def perform_split_once_off():
    split_files = False
    if not split_files:
        print("skipping once-off splitting operation")
    return split_files

Feature Toggles¶

Let's also create a simple feature toggle that we can use to skip expensive operations during notebook work (to save myself some time!)

Set it to true if you want to run absolutely everything. Set to false to skip optional steps/exploratory work.

In [100]:
def run_entire_notebook():
    run_all = False
    if not run_all:
        print("skipping optional operation")
    return run_all
In [101]:
def perform_split_once_off():
    split_files = False
    if not split_files:
        print("skipping once-off splitting operation")
    return split_files

A first peek at the data¶

Fetching and loading the dataset¶

The dataset from Turing College does not include a link to the kaggle page for the dataset. I initially thought it was because it's a custom-self hosted model that was not available on kaggle.

The thing is, it would be really nice if we could find it on Kaggle, as it normally contains valuable information on the dataset, procedence, insights, data dictionary, etc...

So a quick google search showed that the dataset might be the one hosted in here: https://www.kaggle.com/datasets/wordsforthewise/lending-club

We downloaded both and compared the file contents, to make sure they are in fact the same.

Now that we know for sure that this kaggle dataset is the same one that was given to us by Turing, we're ready to start!

Downloading from Kaggle¶

We're ready to start! Let's download the dataset from Kaggle.

In [102]:
dataset_name = "wordsforthewise/lending-club"

db_filename = "accepted_raw.csv"

auto_kaggle.download_dataset(dataset_name, db_filename, timeout_seconds=120)
__
File [dataset/accepted_raw.csv] already exists locally!
No need to re-download dataset [wordsforthewise/lending-club]

This dataset has an unusual structure. There seem to be 4 items.

We could assume that the uncompressed elements are the only needed ones, and that the ZIP files are redundant... but they were all inside the main download, so we want to check them to see:

  • if they are the same/different?
  • if they are different, what are the differences.

Comparing .zip contents¶

Are they the same?

(base) edu@flex:~/turing-college/projects/sprint11-loans/dataset$ mv rejected_2007_to_2018q4.csv original_rejected.csv
(base) edu@flex:~/turing-college/projects/sprint11-loans/dataset$ mv accepted_2007_to_2018q4.csv original_accepted.csv
(base) edu@flex:~/turing-college/projects/sprint11-loans/dataset$ mv accepted_2007_to_2018Q4.csv.gz zip_accepted.csv.gz
(base) edu@flex:~/turing-college/projects/sprint11-loans/dataset$ mv rejected_2007_to_2018Q4.csv.gz zip_rejected.csv.gz
(base) edu@flex:~/turing-college/projects/sprint11-loans/dataset$ gzip --decompress zip_accepted.csv.gz
(base) edu@flex:~/turing-college/projects/sprint11-loans/dataset$ gzip --decompress zip_rejected.csv.gz
(base) edu@flex:~/turing-college/projects/sprint11-loans/dataset$ sha256sum zip_accepted.csv original_accepted.csv/accepted_2007_to_2018Q4.csv 
3eae03c28fd9d2e8a076ebeb73507e8d4d0f44d90500decdb0936e0933d1f36a  zip_accepted.csv
3eae03c28fd9d2e8a076ebeb73507e8d4d0f44d90500decdb0936e0933d1f36a  original_accepted.csv/accepted_2007_to_2018Q4.csv
(base) edu@flex:~/turing-college/projects/sprint11-loans/dataset$ sha256sum zip_rejected.csv original_rejected.csv/rejected_2007_to_2018Q4.csv 
07eb8468d55340d8ca4145c3e3c2e2d3e25ff83c44e432a825729ee6c99c4d45  zip_rejected.csv
07eb8468d55340d8ca4145c3e3c2e2d3e25ff83c44e432a825729ee6c99c4d45  original_rejected.csv/rejected_2007_to_2018Q4.csv
(base) edu@flex:~/turing-college/projects/sprint11-loans/dataset$ 

Turns out, they are the exact same files ✅

The downloaded dataset has duplicated files, and we can just keep one copy of each (1 x accepted + 1 x rejected), and drop the rest.

Deleting duplicated data¶

In [103]:
if run_entire_notebook():
    accepted_raw = "dataset/accepted_raw.csv"
    rejected_raw = "dataset/rejected_raw.csv"

    if path.exists("dataset/accepted_2007_to_2018q4.csv/accepted_2007_to_2018Q4.csv"):
        !mv dataset/accepted_2007_to_2018q4.csv/accepted_2007_to_2018Q4.csv $accepted_raw

    if path.exists("dataset/rejected_2007_to_2018q4.csv/rejected_2007_to_2018Q4.csv"):
        !mv dataset/rejected_2007_to_2018q4.csv/rejected_2007_to_2018Q4.csv $rejected_raw

    folders_to_deleted = [
        "dataset/accepted_2007_to_2018Q4.csv.gz",
        "dataset/rejected_2007_to_2018Q4.csv.gz",
        "dataset/accepted_2007_to_2018q4.csv",
        "dataset/rejected_2007_to_2018q4.csv",
    ]

    for folder in folders_to_deleted:
        if path.exists(folder):
            print(f"deleting empty folder [{folder}]")
            !rm -d "$folder_to_delete"
skipping optional operation

Loading the data into memory¶

Trying to load the entire file into pandas crashes the jupyter kernel, so we will try some other options, for our initial inspection:

Although, it seems to take a serious toll on our system memory+swap

On the positive side, it's good to know that it actually works in raw iPython:

(base) edu@flex:~/turing-college/projects/sprint11-loans/dataset/accepted_2007_to_2018q4.csv$ ipython
Python 3.9.13 (main, Aug 25 2022, 23:26:10) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.31.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: pd.read_csv("accepted_2007_to_2018Q4.csv", index_col=0)
<ipython-input-2-c093d3440acc>:1: DtypeWarning: Columns (0,19,49,59,118,129,130,131,134,135,136,139,145,146,147) have mixed types. Specify dtype option on import or set low_memory=False.
  pd.read_csv("accepted_2007_to_2018Q4.csv", index_col=0)
Out[2]: 
                                                  member_id  loan_amnt  funded_amnt  funded_amnt_inv  ... settlement_date  settlement_amount  settlement_percentage settlement_term
id                                                                                                    ...                                                                          
68407277                                                NaN     3600.0       3600.0           3600.0  ...             NaN                NaN                    NaN             NaN
68355089                                                NaN    24700.0      24700.0          24700.0  ...             NaN                NaN                    NaN             NaN
68341763                                                NaN    20000.0      20000.0          20000.0  ...             NaN                NaN                    NaN             NaN
66310712                                                NaN    35000.0      35000.0          35000.0  ...             NaN                NaN                    NaN             NaN
68476807                                                NaN    10400.0      10400.0          10400.0  ...             NaN                NaN                    NaN             NaN
...                                                     ...        ...          ...              ...  ...             ...                ...                    ...             ...
88985880                                                NaN    40000.0      40000.0          40000.0  ...             NaN                NaN                    NaN             NaN
88224441                                                NaN    24000.0      24000.0          24000.0  ...        Mar-2019            10000.0                  44.82             1.0
88215728                                                NaN    14000.0      14000.0          14000.0  ...             NaN                NaN                    NaN             NaN
Total amount funded in policy code 1: 1465324575        NaN        NaN          NaN              NaN  ...             NaN                NaN                    NaN             NaN
Total amount funded in policy code 2: 521953170         NaN        NaN          NaN              NaN  ...             NaN                NaN                    NaN             NaN

[2260701 rows x 150 columns]

In [3]: _

But let's try to keep our analysis in Jupyter, just for the sake of the reading experience.

While we're on the terminal, let's also take a look at the lines at the end of the file

In [104]:
if run_entire_notebook():
    !tail -n 3 "$accepted_raw"
skipping optional operation

A few things we noticed (so far):

  • The files are enormous (enough to crash Jupyter, LibreOffice, and pretty much anything else)
  • The last lines of the files are not following the rest of the structure and seem to contain summary data (we can skip this with pandas)
  • Some of the lines contain unescaped separator characters (,) which cause the CSV parser to struggle to parse it properly.

💡 We will drop any lines that have too many/few commas, caused by unescaped text (comments, etc...)

Handling raw files¶

Dropping malformed lines¶

We will output a random shuffle of the file contents to a separate file, to take a quick look at the data, without crashing our kernel.

NOTE: These are *nix/linux commands. This next cell will only work in *nix-compatible systems. That is: It's unlikely to work in Windows, might work on Mac (if you install additional packages) but it's not tested.

They create a new sampled file with a few 100k rows from each file. These files will be used for our initial exploration, only! no model training will be done with this random data.

In [105]:
def drop_lines_with_wrong_commas(file_in: str, file_out: str, expected_commas: int):
    if path.exists(file_out):
        print(f"cleaned file already exists [{file_out}]. Skipping operation.")
        return

    print(f"dropping malformed rows from {file_in}: ", end="")
    rows_skipped = 0
    with open(file_in, "r") as inp, open(file_out, "w") as out:
        for row in inp:
            if row.count(",") == expected_commas:
                out.write(row)
            else:
                rows_skipped += 1
    print(f"rows dropped: {rows_skipped}")
In [106]:
if run_entire_notebook():
    acc_lines = !wc -l dataset/accepted_raw.csv
    rej_lines = !wc -l dataset/rejected_raw.csv
    print(acc_lines[0])
    print(rej_lines[0])
skipping optional operation
In [107]:
clean_accepted_raw = "dataset/clean_accepted_raw.csv"
clean_rejected_raw = "dataset/clean_rejected_raw.csv"
In [108]:
if run_entire_notebook():
    drop_lines_with_wrong_commas(accepted_raw, clean_accepted_raw, expected_commas=150)
    drop_lines_with_wrong_commas(rejected_raw, clean_rejected_raw, expected_commas=8)

    del accepted_raw
    del rejected_raw
skipping optional operation

Let's sample a subset of the cleaned files so we can get a general feel for the type of data inside.

In [109]:
sample_accepted = "dataset/sample_clean_accepted.csv"
sample_rejected = "dataset/sample_clean_rejected.csv"

if run_entire_notebook():
    assert clean_accepted_raw
    if not path.exists(sample_accepted):
        !head -n 1 "$clean_accepted_raw" > "$sample_accepted"
        !shuf -n 100000 "$clean_accepted_raw" >> "$sample_accepted"
        # since we are not including the `--repeat` flag, shuf will NOT output repeated lines.

    assert clean_rejected_raw
    if not path.exists(sample_rejected):
        !head -n 1 $clean_rejected_raw > $sample_rejected
        !shuf -n 100000 $clean_rejected_raw >> $sample_rejected
skipping optional operation

Just in the interest of visualizing the sizes we are dealing with, let's count the number of lines for each file:

In [110]:
if run_entire_notebook():
    for file in [
        clean_accepted_raw,
        clean_rejected_raw,
        sample_accepted,
        sample_rejected,
    ]:
        lines = !wc -l $file
        print(lines[0])
skipping optional operation
In [111]:
# delete the variables from the scope, not the files ;)
if run_entire_notebook():
    del clean_accepted_raw
    del clean_rejected_raw
skipping optional operation

Getting the data dictionary for this dataset¶

At this stage, it would be helpful to find the data dictionary for this dataset.

The original source page is gone (HTTP 403 FORBIDDEN), so we will look for the next best thing: someone's backup of the dictionary.

We can readily find a copy of the files here: https://www.kaggle.com/datasets/jonchan2003/lending-club-data-dictionary

In [112]:
data_dictionary = "jonchan2003/lending-club-data-dictionary"
data_dictionary_check = "Lending Club Data Dictionary Approved.csv"

auto_kaggle.download_dataset(
    data_dictionary,
    data_dictionary_check,
    subdir="dataset/dictionary",
    timeout_seconds=120,
)
__
File [dataset/dictionary/Lending Club Data Dictionary Approved.csv] already exists locally!
No need to re-download dataset [jonchan2003/lending-club-data-dictionary]
In [113]:
if run_entire_notebook():
    dict_files = [
        "Approved.csv",
        "Notes.csv",
        "Reject.csv",
    ]

    for file in dict_files:
        source = f"dataset/dictionary/Lending Club Data Dictionary {file}"
        target = f"dataset/dictionary/datadict_{file.lower()}"
        if not path.exists(target):
            print(f"converting to utf-8 and cleaning {file}")
            f = pd.read_csv(source, encoding="Windows-1252", index_col=0)
            f = f.dropna(axis=1, how="all")
            f.to_csv(target)
skipping optional operation
In [114]:
pd.read_csv("dataset/dictionary/datadict_approved.csv")
Out[114]:
LoanStatNew Description
0 acc_now_delinq The number of accounts on which the borrower i...
1 acc_open_past_24mths Number of trades opened in past 24 months.
2 addr_state The state provided by the borrower in the loan...
3 all_util Balance to credit limit on all trades
4 annual_inc The self-reported annual income provided by th...
... ... ...
148 settlement_amount The loan amount that the borrower has agreed t...
149 settlement_percentage The settlement amount as a percentage of the p...
150 settlement_term The number of months that the borrower will be...
151 NaN NaN
152 NaN * Employer Title replaces Employer Name for al...

153 rows × 2 columns

Inspecting data using the sanitised/shuffled dataset¶

The sample datasets we created earlier contain 100k rows.

The idea is to use these for the general exploration to get a sense of the type of data in them.

Once we have a plan, we can use the files with all the data, so we get more data to train our models.

In [115]:
accepted_df_ = pd.read_csv(sample_accepted)
rejected_df_ = pd.read_csv(sample_rejected)
Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
In [116]:
accepted_df_.head()
Out[116]:
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade ... hardship_payoff_balance_amount hardship_last_payment_amount disbursement_method debt_settlement_flag debt_settlement_flag_date settlement_status settlement_date settlement_amount settlement_percentage settlement_term
0 55211652 NaN 13200.0 13200.0 13200.00 36 months 11.53 435.48 B B5 ... NaN NaN Cash N NaN NaN NaN NaN NaN NaN
1 100629169 NaN 13500.0 13500.0 13500.00 36 months 11.44 444.80 B B4 ... NaN NaN Cash N NaN NaN NaN NaN NaN NaN
2 132825772 NaN 13000.0 13000.0 13000.00 36 months 11.98 431.67 B B5 ... NaN NaN Cash N NaN NaN NaN NaN NaN NaN
3 70172881 NaN 15000.0 15000.0 15000.00 36 months 22.45 576.36 E E5 ... NaN NaN DirectPay N NaN NaN NaN NaN NaN NaN
4 169516 NaN 4000.0 4000.0 690.91 36 months 14.70 138.08 E E5 ... NaN NaN Cash N NaN NaN NaN NaN NaN NaN

5 rows × 151 columns

In [117]:
rejected_df_.head()
Out[117]:
Amount Requested Application Date Loan Title Risk_Score Debt-To-Income Ratio Zip Code State Employment Length Policy Code
0 2000.0 2018-06-01 Other NaN 6% 070xx NJ < 1 year 0.0
1 10000.0 2018-09-12 Other NaN 16.91% 280xx NC < 1 year 0.0
2 3000.0 2017-09-30 Debt consolidation 697.0 39.14% 601xx IL < 1 year 0.0
3 14500.0 2018-11-07 Debt consolidation NaN 41.26% 180xx PA < 1 year 0.0
4 10000.0 2018-04-17 Debt consolidation NaN 14.79% 903xx CA < 1 year 0.0

Contrary to what it may seem, the 2 files (accepted and rejected) are not about "accepting a loan", but about which loans were accepted into the platform (lending club dot com).

We are only interested in the first ones.

We will unload unnecessary data from memory.

In [118]:
if run_entire_notebook():
    handy_utils.inspect_scope(globals())
skipping optional operation
In [119]:
if run_entire_notebook():
    del rejected_df_
    del sample_rejected
skipping optional operation
In [120]:
from IPython import get_ipython

if run_entire_notebook():
    get_ipython().run_line_magic("reset", "-f out")
skipping optional operation
In [121]:
if run_entire_notebook():
    handy_utils.inspect_scope(globals())
skipping optional operation

Nice... so much cleaner!

Exploratory Data Analysis¶

In this analysis, we will take a high-level overview of our data (from a random subset of the data).

The goal of this analysis is to get an understanding of what it looks like.

Once we are happy that we understand what the data looks like, we will split the entire dataset into train/val/test splits, we will train our models on the train split and validate/test on the other two.

Since the dataset is quite large, we might even try to use online training to improve our model with small batches, sequentially, instead of trying to train it all at once.

In [32]:
accepted_exploratory_sample = accepted_df_.sample(10000)

Using ydata-profiling to get a general idea of the data¶

ydata is an automated tool that we can use to get a general idea of our dataset, without requiring much code.

It is particularly useful when we have to automate data analysis for cover very large datasets (millions of rows, hundreds of columns), like the one we are working on.

We will also cache the report to disk so that it is available for future runs (saving valuable time!)

In [123]:
@cached_with_pickle()
def ydata_report_data():
    report = ydata_utils.report(
        accepted_exploratory_sample, config_file="config/ydata_custom_config.yaml"
    )
    return report


@cached_profile_report()
def ydata_report():
    return ydata_report_data()


if run_entire_notebook():
    display(ydata_report())
skipping optional operation
Loading from cache [./cached/profile_report/ydata_report.html]

Using pygwalker for rapid-charting exploratory analysis¶

Pygwalker is a tableau-like tool that provides an easy interface to quickly and easily generate lots of charts. Considering that we've spent the last 11 sprints using PLT and Seaborn, i was looking to explore additional tools that simplify EDA and remove some of the burden.

The ideal tool would have:

  • an easy workflow to efficiently explore data
  • versatility
  • ability to persist the analysis done/reload it later

Since we're not looking to create beautiful charts for presenting to stakeholders, and we only care about the insights, we will not be customizing anything in terms of theme. This is just for us ;)

Let's pick a small sample of the entire dataset to perform interactive EDA. The goal is not to get ALL of the insights, but just the most obvious one. We don't care if we miss something, we are trying to get to useful conclusions, quickly and easily.

Since pygwalker struggles with large datasets, 2000 rows should be enough for now.

Before we start using PygWalker, let's configure its privacy policy so that no user data is "accidentally" sent out from our computer. Read more about it in the official page

In [124]:
pyg_conf.set_config({"privacy": "offline"}, save=True)

We are ready to go!

We will load the results from a previously saved session, so that the notebook renders them all instead of starting from scratch.

In [125]:
if run_entire_notebook() or True:
    pyg.walk(
        accepted_exploratory_sample,
        dark="light",
        spec="config/pygwalker/config.json",
    )
skipping optional operation

Data Analsys/Domain insights

  • It seems that 3 of the fields are pretty much identical. We will drop 2 of them and use the 3rd one.
    • We expect our model to correctly pick up on the more relevant field... However, since the ones that are different are so rare, it might be that the model does not have enough to figure it out.
    • I think I will actually just drop 2 of them myself. Being able to reduce the search space x100 (remember: curse of dimensionality) might help a bit with performance (?)
  • The vast majority of loans are individual
  • The vast majority of loans are for 36 months (not 60), and most of the requestors either pay a mortgage or pay rent monthlty. Very few unmortgaged homeowners actually apply for loans.
  • The overwhelming majority of applicants are in then 0-95k income bracket, and the amount that most people apply for is around 4K-12K

Tech insights

  • PygWalker is good for fast charting but it is not an enjoyable experience.
    • Hard to find the fields you are looking for. It's not possible to type the field name and get autocompletion, you have to scroll around
    • Very slow once you get past a few tabs open. Browser struggles with it.
    • Not many customization options (binning brackets, how to do scatter plots without aggregates on fields, etc...)

It's a good start, but I would not recommend it for anything beyond very basic exploratory charting... and even then, LibreOffice might be better, more responsive, and richer.

Using missingno to visualize missing data¶

We can use the missingno library to easily visualize if the missing data follows any patterns.

The "accepted loans" dataset contains some NAs that we want to investigate before going any further.

In [36]:
@run
@cached_chart()
def missing_no_matrix_accepted():
    na_sample = accepted_df_.sample(1000)
    sorted_data = msno.nullity_sort(na_sample, sort="descending")
    msno.matrix(sorted_data)
    ax = plt.gca()

    return ax
Loading from cache [./cached/charts/missing_no_matrix_accepted.png]

Some things to highlight:

  1. Patterns: There seems to be a few patterns in the missing data, as multiple columns show the same spread of gaps.
  2. No data $\neq$ no information: Some of the columns that appear as NAs are likely empty to signify something specific. For example:
    • some fields like mths_since_last_delinq, etc..: do not contain "0" values. An empty cell in this column is significant and meaningful, it means that the person did not miss a payment/was never delinquent, so we should keep it.
    • Columns like these will be populated manually, when it makes sense.

Understanding missing values¶

In [37]:
bootstrap_sample = pd.DataFrame(columns=accepted_df_.columns)

# just enough to make sure we get a representative bootstrapped sample
bootstrap_count = 50
batch_size = 1000

for i in range(bootstrap_count):
    sampled = accepted_df_.sample(batch_size)
    missing_by_col = sampled.isna().sum().to_frame().T
    bootstrap_sample = pd.concat([bootstrap_sample, missing_by_col])

We use a simple bootstrap to understand our distributions of NAs.

We're also interested in seeing if there is much variation between one sample and the next, so we will plot the barchart with 95% CI

In [38]:
nas_mean = bootstrap_sample.mean() / batch_size
nas_mean.sort_values(inplace=True, ascending=False)
bootstrap_sample = bootstrap_sample.T.reindex(nas_mean.index)
In [39]:
bootstrap_sample / batch_size
Out[39]:
0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
member_id 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
orig_projected_additional_accrued_interest 0.997 0.999 0.998 0.996 0.995 0.996 0.997 0.999 0.995 0.994 ... 0.994 0.999 0.998 0.998 0.993 0.997 0.999 0.996 0.997 0.997
hardship_end_date 0.996 0.994 0.997 0.994 0.994 0.993 0.997 0.999 0.995 0.992 ... 0.992 0.999 0.997 0.995 0.992 0.996 0.998 0.996 0.997 0.996
hardship_start_date 0.996 0.994 0.997 0.994 0.994 0.993 0.997 0.999 0.995 0.992 ... 0.992 0.999 0.997 0.995 0.992 0.996 0.998 0.996 0.997 0.996
hardship_type 0.996 0.994 0.997 0.994 0.994 0.993 0.997 0.999 0.995 0.992 ... 0.992 0.999 0.997 0.995 0.992 0.996 0.998 0.996 0.997 0.996
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
revol_bal 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
fico_range_high 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
fico_range_low 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
addr_state 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
id 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

151 rows × 50 columns

In [40]:
drop_threshold = 0.10


@run
@cached_chart()
def plot_nas_from_bootstrap():
    df_ = bootstrap_sample
    df = df_.copy() / batch_size
    plt.figure(figsize=(30, 3))
    chart_utils.rotate_x_labels(rotation=-90)
    plt.title("missing data per field (proportion)")
    chart_utils.line(
        y=(1 - drop_threshold),
        line_color="red",
        label_format=".2f",
    )
    return sns.barplot(data=df.T)
Loading from cache [./cached/charts/plot_nas_from_bootstrap.png]

One last manual inspection¶

For a dataset of this size, the last mandatory step is to step through the file manually, inspect it and take down some notes that will help us later.

The goal is to identify:

  • Which fields we should not be using (because of the risk of data leakage)
  • Which fields are unlike to contribute
  • Any other patterns that we might not be expecting

The original .ods file is attached in the repo

The plan after an initial analysis¶

We have identified a few things we need to do to clean our data, and thanks to the automated tools, it didn't require much effort.

Here are some tasks we have identified so far:

  • Drop fields with too many missing values.
  • Identify the non-contributing fields (or fields that are not usable) so we can drop them
  • Encode some of the fields (properly) so they can be analyzed further
  • Use Statistical Inference to determine which fields are likely to be useful/not useful.
  • Create a target variable.
    • Our task (sprint 1) is to predict whether our system should accept or reject a loan application.
    • However, this 'accepted-into-the-platform' dataset does not have 'accepted/rejected-by-the-bank' loans. It only has loans that were accepted by the bank.
    • In our project, we play the role of the bank who wants to predict and learn from past mistakes.
    • This means that we will have to create our own criteria.
    • We can think about it like "with the data that we have, let's look back in time and see which ones should have been rejected in the first place".
  • Drop the fields that our analysis deems "not strongly contributing" to predicting our target.
  • Handle the "indivudual" and "joint" application types
    • one option is to merge them so they use the same columns, and encode the type into a binary column.
    • another option, more akin to iterative development, is to start with a single simple case (only do individual applications), and leave joint ones for later.
    • I will use the second option, for this iteration: less work, likely to result in a cleaner and simpler model.
    • If we were required to also predict for joint accounts, it would be better to create a second model just for those joint ones, instead of trying to merge this fundamentally different data to be predicted by 1 larger and more complex model. Keeping things simple pays in the long term.
    • So, we will drop joint data (for now), but the methodology to develop a model for joint data would be the exact same as for individual (after some minor adjustments to field names and encoding one column)
  • Feature engineering: we can create some interesting fields derived from the info we already have.

Data Wrangling¶

The first part of the plan involves lots of cleaning, dropping useless data and structuring our dataset so it is usable.

Cleaning data¶

Dropping obvious columns that do not contribute useful info¶

Our dataset still has some columns that contain unusable info:

  • Almost always empty (less than 10% of data in them = more than 90% NAs)
  • All values are unique, without inherent scale/meaning (IDs, ...)
  • Meaningless/Unactionable data (urls, ...)
  • Too many unique values to be useful as categorical (ZIP, State, ...)
  • ⚠️ data leakage - info we will not have at the moment of prediction (lots of them!)
  • etc...

Let's get rid of them!

Before we do, however, we have identified that some columns are empty for a reason.

Some columns are empty, and this emptiness is actually meaningful and contains an important piece of data. For example, the field mths_since_last_delinq is empty when the customer has no delinquency. None of the loans have this value set to 0. We suspect that, when there was no delinquency, instead of being populated with 0, it was just left empty.

Since we don't want to accidentally drop these columns due to lots of NAs, we will manually populate them.

In [41]:
cols_where_na_means_0 = [
    "mths_since_last_delinq",
    "mths_since_last_record",
]

impute_zeros = ColumnTransformer(
    transformers=[
        (
            "impute_with_0",
            SimpleImputer(strategy="constant", fill_value=0),
            cols_where_na_means_0,
        )
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
)
In [42]:
p1_clean_drop_noise = Pipeline(
    [
        ("fill_zeros", impute_zeros),
        ("drop_time", loans_clean.DropTimeDependentColumns()),
        ("drop_joint", loans_clean.DropJointApplicationData()),
        ("drop_leak", loans_clean.DropLeakageColumns()),
        ("drop_string_columns", loans_clean.DropStringColumns()),
        ("drop_na_percent", loans_clean.DropColsWithLessThanXPercentData(drop_threshold)),
        ("fill_target", loans_clean.FillTarget()),
        ("drop_na_rows", loans_clean.DropRowsThatStillHaveNAs()),
        ("p", "passthrough"),
    ]
)
p1_clean_drop_noise
Out[42]:
Pipeline(steps=[('fill_zeros',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('impute_with_0',
                                                  SimpleImputer(fill_value=0,
                                                                strategy='constant'),
                                                  ['mths_since_last_delinq',
                                                   'mths_since_last_record'])],
                                   verbose_feature_names_out=False)),
                ('drop_time', DropTimeDependentColumns()),
                ('drop_joint', DropJointApplicationData()),
                ('drop_leak', DropLeakageColumns()),
                ('drop_string_columns', DropStringColumns()),
                ('drop_na_percent',
                 DropColsWithLessThanXPercentData(percent=0.1)),
                ('fill_target', FillTarget()),
                ('drop_na_rows', DropRowsThatStillHaveNAs()),
                ('p', 'passthrough')])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('fill_zeros',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('impute_with_0',
                                                  SimpleImputer(fill_value=0,
                                                                strategy='constant'),
                                                  ['mths_since_last_delinq',
                                                   'mths_since_last_record'])],
                                   verbose_feature_names_out=False)),
                ('drop_time', DropTimeDependentColumns()),
                ('drop_joint', DropJointApplicationData()),
                ('drop_leak', DropLeakageColumns()),
                ('drop_string_columns', DropStringColumns()),
                ('drop_na_percent',
                 DropColsWithLessThanXPercentData(percent=0.1)),
                ('fill_target', FillTarget()),
                ('drop_na_rows', DropRowsThatStillHaveNAs()),
                ('p', 'passthrough')])
ColumnTransformer(remainder='passthrough',
                  transformers=[('impute_with_0',
                                 SimpleImputer(fill_value=0,
                                               strategy='constant'),
                                 ['mths_since_last_delinq',
                                  'mths_since_last_record'])],
                  verbose_feature_names_out=False)
['mths_since_last_delinq', 'mths_since_last_record']
SimpleImputer(fill_value=0, strategy='constant')
passthrough
DropTimeDependentColumns()
DropJointApplicationData()
DropLeakageColumns()
DropStringColumns()
DropColsWithLessThanXPercentData(percent=0.1)
FillTarget()
DropRowsThatStillHaveNAs()
passthrough

First pass - removing low hanging fruit¶

In this first pass we want to have a pipeline that does most of the easy work:

  • remove useless data
  • fill in target col (should we accept/reject the loan)
  • extract important columns (targets) we will use later
  • ...
In [43]:
print("before", accepted_df_.shape)
cleaned = p1_clean_drop_noise.fit_transform(accepted_df_)
print("after", cleaned.shape)
before (100000, 151)
after (71246, 67)
In [44]:
before_nas = p1_clean_drop_noise.named_steps["drop_na_rows"].before_df
msno.matrix(before_nas)
Out[44]:
<AxesSubplot: >
In [45]:
cleaned.head()
Out[45]:
mths_since_last_delinq mths_since_last_record loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade ... pct_tl_nvr_dlq percent_bc_gt_75 pub_rec_bankruptcies tax_liens tot_hi_cred_lim total_bal_ex_mort total_bc_limit total_il_high_credit_limit disbursement_method predict_accept
0 22.0 0.0 13200.0 13200.0 13200.0 36 months 11.53 435.48 B B5 ... 92.1 100.0 0.0 0.0 26327.0 17827.0 700.0 16227.0 Cash 1
1 52.0 0.0 13500.0 13500.0 13500.0 36 months 11.44 444.80 B B4 ... 95.7 0.0 0.0 0.0 191768.0 42953.0 10200.0 37768.0 Cash 0
3 0.0 0.0 15000.0 15000.0 15000.0 36 months 22.45 576.36 E E5 ... 100.0 0.0 0.0 0.0 87300.0 63908.0 31300.0 56000.0 DirectPay 0
5 12.0 26.0 7000.0 7000.0 7000.0 36 months 11.99 232.47 C C1 ... 84.6 100.0 0.0 1.0 189638.0 36306.0 3900.0 41940.0 Cash 1
6 0.0 0.0 5000.0 5000.0 4975.0 36 months 16.02 175.84 C C5 ... 100.0 100.0 0.0 0.0 243711.0 56111.0 8000.0 71637.0 Cash 0

5 rows × 67 columns

We can see that we are able to extract the columns we will need later, for each of the sprints.

This is just a preview that we are doing for the 10,000 random sample. The real target columns will be extracted when we do the split of the entire dataset, with 2M rows (further down below)

In [46]:
del accepted_df_

Encoding categorical data to allow statistical analysis¶

The next thing is to determine which columns are statistically insignificant in terms of predicting our newly created target column.

In order to do this, we have to encode some of them (the non numerical ones), so we can use ANOVA to assess them.

Some of those will require simple numerical transformations, other will require a second round of encoding (to one-hot-encoding).

In [47]:
p2_encode_categoricals = Pipeline(
    [
        ("encode_employment", loans_clean.EncodeEmpLength()),
        ("EncodeDisbursementMethod", loans_clean.EncodeDisbursementMethod()),
        ("EncodeInitialListingStatus", loans_clean.EncodeInitialListingStatus()),
        ("EncodeVerificationStatus", loans_clean.EncodeVerificationStatus()),
        ("EncodeHomeOwnership", loans_clean.EncodeHomeOwnership()),
        ("EncodePaymentPlan", loans_clean.EncodePaymentPlan()),
        ("EncodePurpose", loans_clean.EncodePurpose()),
        ("EncodeTerm", loans_clean.EncodeTerm()),
        (
            "onehot",
            ce.OneHotEncoder(
                cols=["purpose", "home_ownership"],
                return_df=True,
                use_cat_names=True,
                handle_unknown="ignore",
            ),
        ),
        ("p", "passthrough"),
    ]
)

p2_encode_categoricals
Out[47]:
Pipeline(steps=[('encode_employment', EncodeEmpLength()),
                ('EncodeDisbursementMethod', EncodeDisbursementMethod()),
                ('EncodeInitialListingStatus', EncodeInitialListingStatus()),
                ('EncodeVerificationStatus', EncodeVerificationStatus()),
                ('EncodeHomeOwnership', EncodeHomeOwnership()),
                ('EncodePaymentPlan', EncodePaymentPlan()),
                ('EncodePurpose', EncodePurpose()),
                ('EncodeTerm', EncodeTerm()),
                ('onehot',
                 OneHotEncoder(cols=['purpose', 'home_ownership'],
                               handle_unknown='ignore', use_cat_names=True)),
                ('p', 'passthrough')])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('encode_employment', EncodeEmpLength()),
                ('EncodeDisbursementMethod', EncodeDisbursementMethod()),
                ('EncodeInitialListingStatus', EncodeInitialListingStatus()),
                ('EncodeVerificationStatus', EncodeVerificationStatus()),
                ('EncodeHomeOwnership', EncodeHomeOwnership()),
                ('EncodePaymentPlan', EncodePaymentPlan()),
                ('EncodePurpose', EncodePurpose()),
                ('EncodeTerm', EncodeTerm()),
                ('onehot',
                 OneHotEncoder(cols=['purpose', 'home_ownership'],
                               handle_unknown='ignore', use_cat_names=True)),
                ('p', 'passthrough')])
EncodeEmpLength()
EncodeDisbursementMethod()
EncodeInitialListingStatus()
EncodeVerificationStatus()
EncodeHomeOwnership()
EncodePaymentPlan()
EncodePurpose()
EncodeTerm()
OneHotEncoder(cols=['purpose', 'home_ownership'], handle_unknown='ignore',
              use_cat_names=True)
passthrough
In [48]:
encoded = p2_encode_categoricals.fit_transform(cleaned)

Now that we have encoded all of these salvageable fields, we're ready for the next steps:

Inferential Statistics¶

Yes, Inferential Statistics are a requirement for this project... but they are not making an appearance here, just because of the requirement. They are appearing here because they are useful, relevant and insightful! (wow!)

Instead of just checking the box with a few random t-tests and z-tests... I wanted to actually find ways to use Inferential Statistics in a way that gives cohesion to this story... fitting it with the dataset we are working with.

So, this is why we are:

Using ANOVA to decide which columns to drop from our dataset¶

We want to use ANOVA test to find which columns are more likely relevant when trying to predict our target, and which ones are not so (so we can drop them).

We want to calculate an approximation to the residual contribution of each column: this means that we will be a bit loose with the interpretation of the results we get for a couple of reasons:

  1. We are using an ANOVA test, which could be a useful approximation, but it was not originally designed for this.
  2. One of the requirements of ANOVA (must have same variance), we believe we satisfy due to the amount of data/central limit theorem, etc... but we have not explicitly verified.

In summary, we will not be strict with the interpretation of the results that we get and we will likely not adhere to the usual $\alpha = 0.05$ that we use in most situations.

The purpose of this test is to drop columns that are clearly not contributing to predicting our target. It's just an initial cleanup for the most obvious non-contributing attributes.

In [49]:
anova_threshold = 0.05

target_column = "predict_accept"


def anova_irrelevant_columns(data, threshold):
    results = {}

    for col in data.columns.drop(target_column):
        try:
            marginal_c = [df[col] for grp, df in data.groupby(target_column)]
            results[col] = stats.f_oneway(*marginal_c)
        except:
            print(f'"{col}",')
            raise

    results = pd.DataFrame(results.values(), index=results.keys())
    results = results.sort_values(by="pvalue", ascending=False)
    results["keep"] = results["pvalue"] < threshold
    irrelevant = results.index
    return results


if run_entire_notebook():
    print("ANOVA analysis of field relevance")
    field_relevancy = anova_irrelevant_columns(
        encoded.drop(columns=["grade", "sub_grade", "int_rate"]), anova_threshold
    )
    field_relevancy.loc["dti", "keep"] = True
    display(field_relevancy.head())
    display(field_relevancy.tail())
skipping optional operation

Let's visualize these terms a bit better

In [50]:
@run
@cached_chart()
def anova_pvalue_chart():
    plt.figure(figsize=(5, 18))
    chart_utils.line(x=anova_threshold, label_format=".2f")
    plt.title(
        "ANOVA test pvalue (approximate residual contrib to predicting loan outcome)"
    )
    return sns.scatterplot(
        x=field_relevancy["pvalue"], y=field_relevancy.index, hue=field_relevancy["keep"]
    )
Loading from cache [./cached/charts/anova_pvalue_chart.png]

A few things that are interesting:

FIELDS TO KEEP

  • It is very interesting to see how this model correctly identifies how capitalism in engrained in the loan industry. In particular when it comes to the possible types of "home ownership":
    • if they have fixed monthly costs (rent or mortgage), then the model is very interested in keeping this piece of data
    • if they don't have these recurring costs, anova does not care about it (do they own a house? do they live under a rock?) ANOVA says that THIS does not matter, and that as long as they have no recurring costs, it is irrelevant to know the subtype of ownership.
    • So we will follow ANOVA's suggestion and drop the irrelevant fields. Despite us dropping those other types of home ownership, this FACT (no monthly fees) will still be engrained in the dataset (as a (0, 0) in the 1 hot encoding..) so the system will still be able to spot it.
  • the term field appears as the most likely to be relevant.
    • It could be that this is true...
    • or it could be that the dataset is not updated regularly and there were singificantly more "resolved loans" (whether "fully paid", or "charged off"), simply because of the timeframe.
    • We will keep this field to be used in our model, of course, but we will be particularly careful not to make wild claims about its importance.
    • The same applies to all other fields.
  • the rest of the fields also seem reasonable:
    • amounts loaned, various stats on numbers of accounts they have or they recently opened, their balances, etc..

FIELDS TO DROP

  • the "will not keep" fields also seem reasonable (to be low contributing to predicting loan outcome):
    • employment length, purpose, ...
    • the one that is interesting (because I expected would be really relevant) is DTI.
    • DTI is described in the data dictionary as:

      DTI: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.

    • For some reason, I imagined that this type of "debt-to-income" ratio would be very important to make our predictions, but ANOVA disagrees.
    • After much thinking, I have decided that just removing this field because ANOVA suggests it is not the best... I appreciate ANOVA's interest in helping, but I will disagree with its suggestion in this very specific case and I will leave it in, because I suspect it is one of the most valuable ones.
      • If I am wrong and ANOVA is right: I will wasted a tiny bit of CPU working with an extra columns
      • If I am right and ANOVA is wrong: I will have improved my model significantly by providing it a useful feature. Potentially even reducing training time.

If we had more time, it would be nice to build 2 models, with and without DTI and compare their performance. But the likelihood that this will be worth the time and effort (given the time constraints and project deadlines).. is infinitely close to 0.

Removing irrelevant columns¶

We did a once-off ANOVA test to determine which columns are irrelevant. The output of ANOVA will be hard-coded as a parameter to the custom ColumnDropper() transformer that we are going to use:

In [51]:
if run_entire_notebook():
    anova_irrelevant_cols = field_relevancy[~field_relevancy["keep"]].index.to_list()
    anova_irrelevant_cols
skipping optional operation

Generalizing this cleaning step¶

This analysis seems interesting and potentially reusable. We will create a generic Transformer that can do this for any dataset.

The idea is to remove columns that are clearly not helping to predict any of the target cols. If one of the columns is useful to predict at least one of the targets, we will keep it. We will sent a fine threshold so that our filtering is very picky.

During sprint 4 (predicting interest), we will drop this step, as it only makes sense for categorical labels, not for continuous ones.

In [52]:
new_anova_dropper = clean.AnovaColumnDropper(
    cols_to_predict=["predict_accept", "grade", "sub_grade"],
    threshold=0.0001,
)

if run_entire_notebook():
    new_anova_dropper.fit_transform(encoded).head()
skipping optional operation
In [53]:
p3_remove_anova_cols = Pipeline(
    steps=[
        (
            "drop_ANOVA_cols",
            clean.AnovaColumnDropper(
                cols_to_predict=["predict_accept", "grade", "sub_grade"],
                threshold=0.0001,
            ),
        ),
        ("passthrough", "passthrough"),
    ]
)
p3_remove_anova_cols
Out[53]:
Pipeline(steps=[('drop_ANOVA_cols',
                 AnovaColumnDropper(cols_to_predict=['predict_accept', 'grade',
                                                     'sub_grade'],
                                    threshold=0.0001)),
                ('passthrough', 'passthrough')])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('drop_ANOVA_cols',
                 AnovaColumnDropper(cols_to_predict=['predict_accept', 'grade',
                                                     'sub_grade'],
                                    threshold=0.0001)),
                ('passthrough', 'passthrough')])
AnovaColumnDropper(cols_to_predict=['predict_accept', 'grade', 'sub_grade'],
                   threshold=0.0001)
passthrough
In [54]:
encoded
Out[54]:
mths_since_last_delinq mths_since_last_record loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade ... pct_tl_nvr_dlq percent_bc_gt_75 pub_rec_bankruptcies tax_liens tot_hi_cred_lim total_bal_ex_mort total_bc_limit total_il_high_credit_limit disbursement_method predict_accept
0 22.0 0.0 13200.0 13200.0 13200.0 0 11.53 435.48 B B5 ... 92.1 100.0 0.0 0.0 26327.0 17827.0 700.0 16227.0 1 1
1 52.0 0.0 13500.0 13500.0 13500.0 0 11.44 444.80 B B4 ... 95.7 0.0 0.0 0.0 191768.0 42953.0 10200.0 37768.0 1 0
3 0.0 0.0 15000.0 15000.0 15000.0 0 22.45 576.36 E E5 ... 100.0 0.0 0.0 0.0 87300.0 63908.0 31300.0 56000.0 2 0
5 12.0 26.0 7000.0 7000.0 7000.0 0 11.99 232.47 C C1 ... 84.6 100.0 0.0 1.0 189638.0 36306.0 3900.0 41940.0 1 1
6 0.0 0.0 5000.0 5000.0 4975.0 0 16.02 175.84 C C5 ... 100.0 100.0 0.0 0.0 243711.0 56111.0 8000.0 71637.0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
99993 0.0 0.0 25000.0 25000.0 25000.0 1 10.49 537.23 B B3 ... 100.0 0.0 0.0 0.0 451364.0 49447.0 45900.0 54164.0 1 1
99994 0.0 0.0 20400.0 20400.0 20400.0 0 12.74 684.81 C C1 ... 100.0 60.0 0.0 0.0 151133.0 110387.0 47300.0 89533.0 1 1
99995 0.0 0.0 3000.0 3000.0 3000.0 0 13.99 102.52 C C3 ... 100.0 50.0 0.0 0.0 92324.0 75868.0 28000.0 63324.0 1 0
99998 0.0 0.0 5600.0 5600.0 5600.0 0 9.17 178.53 B B2 ... 100.0 0.0 0.0 0.0 174445.0 48390.0 15400.0 57193.0 1 1
99999 9.0 0.0 1000.0 1000.0 1000.0 0 20.39 37.37 D D4 ... 88.9 0.0 0.0 0.0 29750.0 33425.0 3000.0 25750.0 1 0

71246 rows × 74 columns

In [55]:
after_anova_clean = p3_remove_anova_cols.fit_transform(encoded)
after_anova_clean.head()
doing anova for predict_accept
doing anova for grade
doing anova for sub_grade
Out[55]:
mths_since_last_delinq mths_since_last_record loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade ... pct_tl_nvr_dlq percent_bc_gt_75 pub_rec_bankruptcies tax_liens tot_hi_cred_lim total_bal_ex_mort total_bc_limit total_il_high_credit_limit disbursement_method predict_accept
0 22.0 0.0 13200.0 13200.0 13200.0 0 11.53 435.48 B B5 ... 92.1 100.0 0.0 0.0 26327.0 17827.0 700.0 16227.0 1 1
1 52.0 0.0 13500.0 13500.0 13500.0 0 11.44 444.80 B B4 ... 95.7 0.0 0.0 0.0 191768.0 42953.0 10200.0 37768.0 1 0
3 0.0 0.0 15000.0 15000.0 15000.0 0 22.45 576.36 E E5 ... 100.0 0.0 0.0 0.0 87300.0 63908.0 31300.0 56000.0 2 0
5 12.0 26.0 7000.0 7000.0 7000.0 0 11.99 232.47 C C1 ... 84.6 100.0 0.0 1.0 189638.0 36306.0 3900.0 41940.0 1 1
6 0.0 0.0 5000.0 5000.0 4975.0 0 16.02 175.84 C C5 ... 100.0 100.0 0.0 0.0 243711.0 56111.0 8000.0 71637.0 1 0

5 rows × 67 columns

In [56]:
print(encoded.shape)
print(after_anova_clean.shape)
print(p3_remove_anova_cols.named_steps["drop_ANOVA_cols"].irrelevant_)
(71246, 74)
(71246, 67)
{'emp_length', 'num_tl_120dpd_2m', 'home_ownership_other', 'delinq_amnt', 'home_ownership_own', 'pymnt_plan', 'purpose_pleasure'}

Unsupervised data cleaning¶

There are a couple of things we still want to do before we jump into creating predicting models:

  1. Highly correlated features (to avoid multicollinearity issues)
  2. Outlier detection (and removal)

Detecting and Removing highly-correlated features¶

A simple heatmap will show us some obvious problems, but we don't want to do this removal manually, we want to build a simple and reusable transformer that can do it automatically (for this and future datasets)

In [58]:
@run
@cached_chart()
def corr_heatmap():
    chart_utils.canvas(15, 13)
    return sns.heatmap(after_anova_clean.corr(), cmap="coolwarm")
Out[58]:
<AxesSubplot: >

Detecting and Removing Outliers¶

Since this dataset is enormous, and has so many features, it will be hard to do outlier detection manually/visually. We will need some tooling to perform the detection automatically.

Let's drop the ~5% more obvious outliers (we can use threshold 1.95, just like a standard score).

This custom transformer delegates to scikit-lego's GMMOutlierDetector(), which -in turn- uses a classifier that relies on GaussianMixtureModels.

Without being able to visualize and see what the dataset looks like before and after, it's hard to assess whether 5% was the right threshold... However, we do know that (using Pareto Principle) just removing the worst offenders, even if few, we can substantially ease the work needed for our model to learn effectively. Increasing the threshold risks removing more and more actually useful features.

Before detecting and removing outiers, however, we should remove some columns that will not be there during prediction. Remember, we are just exploring the different building blocks that we will use later in our end to end pipeline.

In [59]:
after_anova_clean = after_anova_clean.drop(columns=["grade", "sub_grade"])
In [67]:
p4_remove_outliers_and_correlations = Pipeline(
    [
        (
            "gmm_outlier_removal",
            clean.OutlierRemovalUsingGaussianMixtureModel(
                threshold=0.85, method="stddev", reg_covar=5e-5
            ),
        ),
        ("drop_correlated_cols", clean.DropCorrelatedCols(threshold=0.8)),
        ("passthrough", "passthrough"),
    ]
)


if run_entire_notebook() or True:
    without_outliers = p4_remove_outliers_and_correlations.fit_transform(
        after_anova_clean
    )
else:
    without_outliers = after_anova_clean
skipping optional operation

Ideally, we would like to actually train this specific part of the pipeline with a lot more data (9k rows is not a lot to correctly understand the overall shape of a 2M-row-dataset), but it's a start.

Just for curiosity, let's check out which columns were dropped due to high correlations (and check that they were the same ones we saw on the heatmap before), and how many rows were removed after GMM marked them as outliers.

In [68]:
p4_remove_outliers_and_correlations.named_steps["gmm_outlier_removal"].outliers_drop_count
Out[68]:
2102
In [69]:
p4_remove_outliers_and_correlations.named_steps["drop_correlated_cols"].dropped_features
Out[69]:
['funded_amnt',
 'funded_amnt_inv',
 'installment',
 'home_ownership_mortgage',
 'fico_range_high',
 'avg_cur_bal',
 'bc_util',
 'num_actv_rev_tl',
 'num_bc_sats',
 'num_op_rev_tl',
 'num_rev_accts',
 'num_rev_tl_bal_gt_0',
 'num_sats',
 'percent_bc_gt_75',
 'pub_rec_bankruptcies',
 'tot_hi_cred_lim',
 'total_bc_limit',
 'total_il_high_credit_limit']
In [70]:
display(after_anova_clean.shape)
display(without_outliers.shape)
(71246, 65)
(69144, 47)

I was initially surprised that installment appeared as a column to drop, but after noticing that all credits have the same lengths (36 or 60 months), it makes perfect sense. Basically, among all the credits with the same duration, the installments is basically = total amount / 36 or 60.

If there were a lot more types of loan durations, then this would no longer be highly correlated, but since there are only 2... it is basically a guarantee.

So, it makes sense to let these 4 features go. It goes to show how, sometimes our intuition gives us the wrong initial expectations ("for sure installments will be critical for prediction"), and then statistics shows up and helps us trust the numbers instead of our opinion, hopes and dreams.

In [71]:
without_outliers.head()
Out[71]:
mths_since_last_delinq mths_since_last_record loan_amnt term int_rate home_ownership_rent annual_inc verification_status purpose_other purpose_debt ... num_bc_tl num_il_tl num_tl_30dpd num_tl_90g_dpd_24m num_tl_op_past_12m pct_tl_nvr_dlq tax_liens total_bal_ex_mort disbursement_method predict_accept
0 22.0 0.0 13200.0 0 11.53 1 50000.0 1 1 0 ... 3.0 21.0 0.0 0.0 1.0 92.1 0.0 17827.0 1 1
1 52.0 0.0 13500.0 0 11.44 0 67500.0 0 0 1 ... 8.0 10.0 0.0 0.0 1.0 95.7 0.0 42953.0 1 0
3 0.0 0.0 15000.0 0 22.45 1 62000.0 1 0 1 ... 6.0 3.0 0.0 0.0 5.0 100.0 0.0 63908.0 2 0
5 12.0 26.0 7000.0 0 11.99 0 75000.0 1 0 1 ... 5.0 14.0 0.0 0.0 4.0 84.6 1.0 36306.0 1 1
6 0.0 0.0 5000.0 0 16.02 0 65000.0 0 0 0 ... 3.0 27.0 0.0 0.0 0.0 100.0 0.0 56111.0 1 0

5 rows × 47 columns

In [72]:
md("**noise removed** (so far)")
print(f"{1-(without_outliers.shape[0] / 100_000):.2%} rows dropped")
print(f"{1-(without_outliers.shape[1] / 150):.2%} columns dropped")

noise removed (so far)

30.86% rows dropped
68.67% columns dropped

And with this, we're down to a more reasonable size of dataset! 🙌

The only data left is data that is both relevant and useful for our future predictions.

But we can do something more:

Feature Engineering¶

Some thoughts on Business Expertise (accept/reject logic)¶

There were 2 ways to approach this project: One was to use the two datasets (accept and reject), and building a model that can predict accept/reject based on whatever columns are common to both datasets

The number of common columns was so small that this seemed too simple and unlikely.

So instead, I decided to simulate a different scenario:

  • The loans into the platform are all of the ones in "accepted" dataset, and our job as financial planners is to look back, introspect, and build a model that will be able to help us predict loans better.

Ideally, we want to have a model that can predict loans that will be fully paid, and that will reject loans that are unlikely to be fully paid.

This means that we have to create out target column, we will use some features from the dataset, and we will delete those columns. Once those are gone, we will train the model on the remaining data and see if it is able to find the underlying patterns under the data.

There is only one minor problem: We are but a mere aspiring data scientist... and we're definitely not experts in finance.

In case you hand't noticed yet, I do not have the slightest clues of financial terminology. In a real scenario, we would be checking with finance experts... but for now, we will define simple criteria for our derived target class.

But for now, let's pretend everything is going well and our financial experts have agreed with out criteria for accept/reject loans.

Creating some derived features¶

It seems our dataframe is ready for our model!

But before this, we want to create a few additional features that might help the model:

  • The DTI (debt to income) field that we decided to keep during ANOVA cleanup will (hopefully) be useful, but we noticed that the ratio is not actually a direct ratio of any other fields, so we will enrich our dataset with the raw ratio (and we'll call it *_custom so we can differentiate it)
  • We also want to find out if the total_rev_hi_lim (max credit) can be used to enrich other variables.

    total_rev_hi_lim is a variable that represents the total revolving high credit or credit limit for a borrower. It is the maximum amount of credit that a financial institution extends to a borrower through a revolving credit account, such as a credit card or a home equity line of credit.

In [83]:
without_outliers["dti_ratio_custom"] = (
    without_outliers.loan_amnt / without_outliers.annual_inc
)
without_outliers["loan_to_credit_limit"] = (
    without_outliers.loan_amnt / without_outliers.total_rev_hi_lim
)
without_outliers.T.tail()
Out[83]:
0 1 3 5 6 7 8 9 10 11 ... 99983 99984 99987 99988 99990 99993 99994 99995 99998 99999
total_bal_ex_mort 17827.000000 42953.000000 63908.000000 36306.000000 56111.000000 26295.000000 19893.000000 9336.000000 37352.000000 56133.000000 ... 18242.000000 39874.000000 51453.000000 47690.00 15323.000 49447.000000 110387.000000 75868.000000 48390.000000 33425.00
disbursement_method 1.000000 1.000000 2.000000 1.000000 1.000000 1.000000 1.000000 2.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.00 1.000 1.000000 1.000000 1.000000 1.000000 1.00
predict_accept 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000 0.000000 0.000000 1.000000 ... 1.000000 0.000000 0.000000 0.00 0.000 1.000000 1.000000 0.000000 1.000000 0.00
dti_ratio_custom 0.264000 0.200000 0.241935 0.093333 0.076923 0.096154 0.250000 0.468000 0.200000 0.322581 ... 0.260000 0.154545 0.205941 0.15 0.150 0.200000 0.237209 0.048000 0.121739 0.04
loan_to_credit_limit 1.306931 0.794118 0.479233 1.428571 0.264550 0.113122 0.219298 0.314516 0.192308 0.505051 ... 0.959732 1.322222 0.568306 1.00 0.625 0.400641 0.331169 0.103448 0.327485 0.25

5 rows × 69144 columns

Preparing our data to make predictions¶

Build the cleaning pipeline¶

Next, let's build an über-pipeline, which will be the overarching "cleaning pipeline" composed of all the previous smallers steps, executed sequentially.

This will be used in our next step:

  • We need a cleaning pipeline that we can use for all our data (regardless of the target)
  • We also need a second round of cleaning that very much depends on our target variable (eg. to detect and remove columns that do not contribute to the target label/result)
In [82]:
p_cleaning_general = Pipeline(
    [
        ("1_drop_noise", p1_clean_drop_noise),
        ("2_encode_categoricals", p2_encode_categoricals),
    ]
)
p_cleaning_general
Out[82]:
Pipeline(steps=[('1_drop_noise',
                 Pipeline(steps=[('fill_zeros',
                                  ColumnTransformer(remainder='passthrough',
                                                    transformers=[('impute_with_0',
                                                                   SimpleImputer(fill_value=0,
                                                                                 strategy='constant'),
                                                                   ['mths_since_last_delinq',
                                                                    'mths_since_last_record'])],
                                                    verbose_feature_names_out=False)),
                                 ('drop_time', DropTimeDependentColumns()),
                                 ('drop_joint', DropJointApplicationData...
                                  EncodeInitialListingStatus()),
                                 ('EncodeVerificationStatus',
                                  EncodeVerificationStatus()),
                                 ('EncodeHomeOwnership', EncodeHomeOwnership()),
                                 ('EncodePaymentPlan', EncodePaymentPlan()),
                                 ('EncodePurpose', EncodePurpose()),
                                 ('EncodeTerm', EncodeTerm()),
                                 ('onehot',
                                  OneHotEncoder(cols=['purpose',
                                                      'home_ownership'],
                                                handle_unknown='ignore',
                                                use_cat_names=True)),
                                 ('p', 'passthrough')]))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('1_drop_noise',
                 Pipeline(steps=[('fill_zeros',
                                  ColumnTransformer(remainder='passthrough',
                                                    transformers=[('impute_with_0',
                                                                   SimpleImputer(fill_value=0,
                                                                                 strategy='constant'),
                                                                   ['mths_since_last_delinq',
                                                                    'mths_since_last_record'])],
                                                    verbose_feature_names_out=False)),
                                 ('drop_time', DropTimeDependentColumns()),
                                 ('drop_joint', DropJointApplicationData...
                                  EncodeInitialListingStatus()),
                                 ('EncodeVerificationStatus',
                                  EncodeVerificationStatus()),
                                 ('EncodeHomeOwnership', EncodeHomeOwnership()),
                                 ('EncodePaymentPlan', EncodePaymentPlan()),
                                 ('EncodePurpose', EncodePurpose()),
                                 ('EncodeTerm', EncodeTerm()),
                                 ('onehot',
                                  OneHotEncoder(cols=['purpose',
                                                      'home_ownership'],
                                                handle_unknown='ignore',
                                                use_cat_names=True)),
                                 ('p', 'passthrough')]))])
Pipeline(steps=[('fill_zeros',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('impute_with_0',
                                                  SimpleImputer(fill_value=0,
                                                                strategy='constant'),
                                                  ['mths_since_last_delinq',
                                                   'mths_since_last_record'])],
                                   verbose_feature_names_out=False)),
                ('drop_time', DropTimeDependentColumns()),
                ('drop_joint', DropJointApplicationData()),
                ('drop_leak', DropLeakageColumns()),
                ('drop_string_columns', DropStringColumns()),
                ('drop_na_percent',
                 DropColsWithLessThanXPercentData(percent=0.1)),
                ('fill_target', FillTarget()),
                ('drop_na_rows', DropRowsThatStillHaveNAs()),
                ('p', 'passthrough')])
ColumnTransformer(remainder='passthrough',
                  transformers=[('impute_with_0',
                                 SimpleImputer(fill_value=0,
                                               strategy='constant'),
                                 ['mths_since_last_delinq',
                                  'mths_since_last_record'])],
                  verbose_feature_names_out=False)
['mths_since_last_delinq', 'mths_since_last_record']
SimpleImputer(fill_value=0, strategy='constant')
['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'fico_range_low', 'fico_range_high', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d', 'last_fico_range_high', 'last_fico_range_low', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m', 'open_act_il', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m', 'acc_open_past_24mths', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util', 'chargeoff_within_12_mths', 'delinq_amnt', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mort_acc', 'mths_since_recent_bc', 'mths_since_recent_bc_dlq', 'mths_since_recent_inq', 'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd', 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'num_rev_accts', 'num_rev_tl_bal_gt_0', 'num_sats', 'num_tl_120dpd_2m', 'num_tl_30dpd', 'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'pct_tl_nvr_dlq', 'percent_bc_gt_75', 'pub_rec_bankruptcies', 'tax_liens', 'tot_hi_cred_lim', 'total_bal_ex_mort', 'total_bc_limit', 'total_il_high_credit_limit', 'revol_bal_joint', 'sec_app_fico_range_low', 'sec_app_fico_range_high', 'sec_app_earliest_cr_line', 'sec_app_inq_last_6mths', 'sec_app_mort_acc', 'sec_app_open_acc', 'sec_app_revol_util', 'sec_app_open_act_il', 'sec_app_num_rev_accts', 'sec_app_chargeoff_within_12_mths', 'sec_app_collections_12_mths_ex_med', 'sec_app_mths_since_last_major_derog', 'hardship_flag', 'hardship_type', 'hardship_reason', 'hardship_status', 'deferral_term', 'hardship_amount', 'hardship_start_date', 'hardship_end_date', 'payment_plan_start_date', 'hardship_length', 'hardship_dpd', 'hardship_loan_status', 'orig_projected_additional_accrued_interest', 'hardship_payoff_balance_amount', 'hardship_last_payment_amount', 'disbursement_method', 'debt_settlement_flag', 'debt_settlement_flag_date', 'settlement_status', 'settlement_date', 'settlement_amount', 'settlement_percentage', 'settlement_term']
passthrough
DropTimeDependentColumns()
DropJointApplicationData()
DropLeakageColumns()
DropStringColumns()
DropColsWithLessThanXPercentData(percent=0.1)
FillTarget()
DropRowsThatStillHaveNAs()
passthrough
Pipeline(steps=[('encode_employment', EncodeEmpLength()),
                ('EncodeDisbursementMethod', EncodeDisbursementMethod()),
                ('EncodeInitialListingStatus', EncodeInitialListingStatus()),
                ('EncodeVerificationStatus', EncodeVerificationStatus()),
                ('EncodeHomeOwnership', EncodeHomeOwnership()),
                ('EncodePaymentPlan', EncodePaymentPlan()),
                ('EncodePurpose', EncodePurpose()),
                ('EncodeTerm', EncodeTerm()),
                ('onehot',
                 OneHotEncoder(cols=['purpose', 'home_ownership'],
                               handle_unknown='ignore', use_cat_names=True)),
                ('p', 'passthrough')])
EncodeEmpLength()
EncodeDisbursementMethod()
EncodeInitialListingStatus()
EncodeVerificationStatus()
EncodeHomeOwnership()
EncodePaymentPlan()
EncodePurpose()
EncodeTerm()
OneHotEncoder(cols=['purpose', 'home_ownership'], handle_unknown='ignore',
              use_cat_names=True)
passthrough

Splitting our data into train/val/test¶

Normally, we could build a pipeline that combines cleaning/prediction steps... However, given the large scale of this project, we will do something else.

⚠️ Given the massive dataset we have, we need to be careful about how we approach our train/val/test split.

Considering that it will be hard to load all of the data at once through jupyter/pandas, we might be tempted to load just a bit of the data. However, we need to be careful to make sure that the data split is done carefully. This includes:

  • We cannot risk splitting the data based on the order in which it appears in the file (it should be randomly sampled)
  • We cannot risk having different splits changing between one day and the next, as this would result in data leakage! (and it would be relatively hard to spot)
  • We cannot risk crashing our notebook/kernel mid workflow due to a massive dataframe being loaded into memory. We want our data to be loaded up, in manageable sizes (that we can control)!
    • This would even allow us to perform online training/incremental training. Fancy!
    • One option is to use numpy.memmap to map an array "in memory" to an actual file on disk... a different approach to achieve the same thing.

However, in order to do any of this, we need to find a solution that keeps the state/data split in a way that persists across system reboots. 🤔If only there were some type of ... you know... persistent storage, that we could use

Fortunately, storage drives are becoming almost mainstream (over the last century) and I am blessed with a laptop that comes with a fancy SSD that I suspect will work wonders for our problem at hand.

So, this is how we will approach this challenge:

  • We will decide on the % splits to use for this project.
  • Instead of doing the split in the notebook live (in memory, at once), we will do the split as files on disk.
  • They will be stored with clearly identifiable file names.
  • We will ensure that the random split is okay and that it respects our chosen percentages by piping their contents through our cleaning pipeline once.
    • All of our steps in our cleaning pipeline are custom transformers that do nothing on their .fit() method. They only do work during .transform(). So there is no risk of our pipeline accidentally learning anything from the files.
    • We will manually check that the outputs of the cleaned files roughly approximate the initial % we wanted. This will guarantee that when it comes time to actually use the files, that they have a fair chance to do their work (because they have enough data).

Once all of this is done, we will simply load each file from disk as needed.

In the interest of saving me some valuable time, instead of doing every time the notebook is started, we will only do this once, and we will actually store the outputs of the files, after being cleaned. This will make sure that we only have to clean our data once in our lifetime (hopefully).

Another important part in avoiding data leakage and bias is that we will promise not to look at the test file for any reason. If we were to accidentally look at it and analyse it (even manually), it could bias us and our model could have data leakage problems because of ideas we got while looking at it. In a production environment, the final test file would NEVER be visible to the data scientists working on the model.. but in this project, you will have to simply trust me and take my word for it 🤷. Hopefully, this long explanation provides enough evidence to gain your trust.

First attempt - using dask¶

Since the dataset is so large, we will use the dask library to read the file, in a streaming fashion, load it into memory, pipe it through the cleaning pipeline, and then split it into 3 parts (train/val/test) before storing it forever into the right files

Dask (and pandas) struggles when they try to do type inference on large datasets, especially if the type they inferred turned out to be too narrow/wrong for the data loader later down the file.

Specifying the types we know are ambiguous explicitly, will help dask/pd a lot.

Even with dask... this notebook crashed my laptop a couple of times, and i had to kill the kernel a few more times as it was approaching the system limit.

This one was a close call!

  • Purple is RAM
  • Green is SWAP space

Our attempts to use dask did not succeed as it got stuck with out of the pipeline steps.

It seems that most operations from dask/scikit learn are cross compatible, but dask does not support .take() operations (which scikit learn tried to perform), and the only option was to try to go and find out which specific step in our dataframe was causing problems.

We will not be doing this and abandoning dask, as there are easier solutions. We are not trying to perform complext distributed computing, we just want to shuffle and split a large file with some guarantees (the guarantee that we care about is what some libraries call "exactly-once" row usage)

Goodbye, Dask! Until we meet again, my friend!

Second Attempt - using command line tools (*nix required)¶

Same as the first few operations at the beginning of the notebook. This will require a *nix-compatible system.

We will use command line tools that are optimized and low-level. Most of them process files without care for format, just as a set of lines or characters.

Luckily for us, we already removed all of the garbage in previous steps, so we will start with files that guarantee 1 row of text = 1 record of data (since we dropped everything else at the beginning of the notebook)

In [74]:
# just a reminder, to avoid accidentally doing this more than once on my PC
perform_split_once_off()
skipping once-off splitting operation
Out[74]:
False
  • Extract first row (header)
  • Skip last 2 lines (metadata)
  • and Shuffle all the rest of rows
In [75]:
shuffled_all_data = "dataset/split/shuffled_all_data.csv"
header = "dataset/split/header.csv"
In [76]:
if perform_split_once_off():
    if not path.exists("./dataset/split"):
        os.makedirs("./dataset/split", exist_ok=True)

    # Extract the header and save it to a new file
    !head -n 1 $clean_accepted_raw > $header

    # Exclude the last two lines,
    # Skip the first one (header),
    # Shuffle the remaining lines and
    # Append them to the new file
    !tail -n +2 $clean_accepted_raw | head -n -2 | shuf > $shuffled_all_data
skipping once-off splitting operation

We are not ready to construct 3 new splits from the previous 2 files.

Each file is guaranteed to:

  • contain a header
  • have a 80/10/10 split for train/hyperparam_tuning/test
    • yes, with the enormous amount of data we have, 10% for hyperparam tuning/and for test, might be overkill. We could probably use a 90/5/5 split and still have a performant model.
    • see summary/technical closing notes at the end of the notebook for more insights about this ratio of splits and how it created nightmares for me. I could have easily fixed the issues using a 90/5/5 split, but then I would have learned substantially less than what I did, so I am leaving a suboptimal split, I am holding your hand and I am dragging you down into the darkness with me. Let's walk this learning journey together! :)
  • guarantee to use each row exactly once (not 0, not 2)
In [77]:
train_file = "dataset/split/train.csv"
hyperparam_file = "dataset/split/hyperparam.csv"
test_file = "dataset/split/test.csv"
In [78]:
if perform_split_once_off():
    n_rows = !wc -l < $shuffled_all_data

    n_rows = int(n_rows[0])
    n_rows

    train_proportion = 0.8
    hyperparam_proportion = 0.1

    # Calculate the number of rows for each set
    n_train = int(n_rows * train_proportion)
    n_hyperparam = int(n_rows * hyperparam_proportion)
    n_test = n_rows - n_train - n_hyperparam

    # Extract the header and save it to the train, val, and test files
    !head -n 1 $header > $train_file
    !head -n 1 $header > $hyperparam_file
    !head -n 1 $header > $test_file

    # Add the relevant rows to each file
    !tail -n +2 $shuffled_all_data | head -n $n_train >> $train_file
    !tail -n +$(($n_train + 2)) $shuffled_all_data | head -n $n_hyperparam >> $hyperparam_file
    !tail -n +$(($n_train + $n_hyperparam + 2)) $shuffled_all_data | head -n $n_test >> $test_file
skipping once-off splitting operation

If everything went well, (and you re-ran the entire notebook, enabling the feature flags) you should now see the three output files:

In [79]:
if perform_split_once_off():
    for file in [train_file, hyperparam_file, test_file]:
        lines = !wc -l $file
        print(lines[0])
        size = !ls -hl $file
        print(size)
skipping once-off splitting operation

These are the sizes of the generated files (before any cleaning!)

file rows size
dataset/split/train.csv 1756777 1.3G
dataset/split/hyperparam.csv 219598 155M
dataset/split/test.csv 219598 155M

Clean our 3 datasets¶

Now that we have succesfully split our initial data into 3 sets, we will clean them all, so we only have to do this once (saving us time in future reruns of the notebook).

Needless to say that, I suspect, in a real production/professional environment, the steps would be similar, if only slightly more complex and scaled up.

Before we start, let's disable the flag that allowed us to do EDA. We don't need to keep this field for our end to end run. We will extract it from inside the transformer status, like we do with the other 3

In [80]:
# p_cleaning_general.set_params(**{"1_drop_noise__extract_predict_accept__drop": False})
__

During most circumstances, we don't need to see what's going on, but it's good to have an easy way to enable verbose mode from a single toggle (for those rare cases where the data wrangling pipeline goes weird)

In [84]:
verbose = False
p_cleaning_general.set_params(verbose=verbose)
p_cleaning_general.set_params(**{"1_drop_noise__verbose": verbose})
p_cleaning_general.set_params(**{"2_encode_categoricals__verbose": verbose})
p_cleaning_general.set_params(**{"2_encode_categoricals__onehot__verbose": verbose})
__

Let's also centralise filename logic so we don't have to change it across the entire codebase

In [85]:
%reset -f out
Flushing output cache (19 entries)
In [88]:
loans_utils.dataset_filename(Sprint.acceptance, Phase.hyperparam, Dataset.y)
Out[88]:
'dataset/sprint_1/hyperparam_y.csv'
In [89]:
# Let's create unfitted pipelines, to ensure that we must fit them first with training data only.

p_cleaning_general = clone_pipeline(p_cleaning_general)
p3_remove_anova_cols = clone_pipeline(p3_remove_anova_cols)
p4_remove_outliers_and_correlations = clone_pipeline(p4_remove_outliers_and_correlations)


def clean_split(phase: Phase) -> None:
    """
    Cleans a dataframe as a bulk operation.
    If needed, this could be changed to load/process the file in chunks,
    but seems this is not necessary, right now.
    """
    raw_filename = phase.value.raw_filename
    print(f"loading raw file {raw_filename}")
    df = pd.read_csv(raw_filename)

    if phase is Phase.train:
        print("training dataset - fitting first")
        print("performing general cleaning")
        first_clean_original = p_cleaning_general.fit_transform(df)

        print("performing anova cleaning")
        after_anova = p3_remove_anova_cols.fit_transform(first_clean_original)

        all_target_names = [s.value.target_field for s in Sprint]
        all_target_cols = first_clean_original[all_target_names]
        after_anova.drop(columns=all_target_names, inplace=True)

        print("performing outliers and correlations cleaning")
        after_outliers = p4_remove_outliers_and_correlations.fit_transform(after_anova)

    else:
        print("performing general cleaning")
        first_clean_original = p_cleaning_general.transform(df)

        print("performing anova cleaning")
        after_anova = p3_remove_anova_cols.transform(first_clean_original)

        all_target_names = [s.value.target_field for s in Sprint]
        all_target_cols = first_clean_original[all_target_names]
        after_anova.drop(columns=all_target_names, inplace=True)

        print("performing outliers and correlations cleaning")
        after_outliers = p4_remove_outliers_and_correlations.transform(after_anova)

    print("splitting into sprints")
    for s in Sprint:
        folder = dataset_folder(s)
        if not path.exists(folder):
            os.makedirs(folder, exist_ok=True)

        second_clean = after_outliers.copy()
        target_col = s.value.target_field
        print(
            f"preparing datasets for sprint {s.value.num}, "
            f"with target '{target_col}'",
        )

        X = second_clean
        y = all_target_cols[target_col].loc[X.index]
        if X.shape[0] == y.shape[0]:
            print(f"X and y have same rowcount: {X.shape}")
        else:
            logger.warning(
                f"X and y have different rowcount: {X.shape[0]} vs. {y.shape[0]}"
            )

        X.to_csv(dataset_filename(s, phase, Dataset.X))
        y.to_csv(dataset_filename(s, phase, Dataset.y))
In [90]:
# onehot = p_cleaning_general.named_steps["2_encode_categoricals"].named_steps["onehot"]

if perform_split_once_off():
    for phase in Phase:
        print(f"preparing [{phase.name}] data for all sprints")
        clean_split(phase)
else:
    # Just to record the work done during the actual split
    print(
        """
preparing [train] data for all sprints
loading raw file dataset/split/train.csv
training dataset - fitting first
performing general cleaning
performing anova cleaning
doing anova for predict_accept
doing anova for grade
doing anova for sub_grade
performing outliers and correlations cleaning
splitting into sprints
preparing datasets for sprint 1, with target 'predict_accept'
X and y have same rowcount: 1249108
preparing datasets for sprint 2, with target 'grade'
X and y have same rowcount: 1249108
preparing datasets for sprint 3, with target 'sub_grade'
X and y have same rowcount: 1249108
preparing datasets for sprint 4, with target 'int_rate'
X and y have same rowcount: 1249108

preparing [hyperparam] data for all sprints
loading raw file dataset/split/hyperparam.csv
performing general cleaning
performing anova cleaning
performing outliers and correlations cleaning
splitting into sprints
preparing datasets for sprint 1, with target 'predict_accept'
X and y have same rowcount: 155727
preparing datasets for sprint 2, with target 'grade'
X and y have same rowcount: 155727
preparing datasets for sprint 3, with target 'sub_grade'
X and y have same rowcount: 155727
preparing datasets for sprint 4, with target 'int_rate'
X and y have same rowcount: 155727

preparing [test] data for all sprints
loading raw file dataset/split/test.csv
performing general cleaning
performing anova cleaning
performing outliers and correlations cleaning
splitting into sprints
preparing datasets for sprint 1, with target 'predict_accept'
X and y have same rowcount: 156108
preparing datasets for sprint 2, with target 'grade'
X and y have same rowcount: 156108
preparing datasets for sprint 3, with target 'sub_grade'
X and y have same rowcount: 156108
preparing datasets for sprint 4, with target 'int_rate'
X and y have same rowcount: 156108
"""
    )
skipping once-off splitting operation

preparing [train] data for all sprints
loading raw file dataset/split/train.csv
training dataset - fitting first
performing general cleaning
performing anova cleaning
doing anova for predict_accept
doing anova for grade
doing anova for sub_grade
performing outliers and correlations cleaning
splitting into sprints
preparing datasets for sprint 1, with target 'predict_accept'
X and y have same rowcount: 1249108
preparing datasets for sprint 2, with target 'grade'
X and y have same rowcount: 1249108
preparing datasets for sprint 3, with target 'sub_grade'
X and y have same rowcount: 1249108
preparing datasets for sprint 4, with target 'int_rate'
X and y have same rowcount: 1249108

preparing [hyperparam] data for all sprints
loading raw file dataset/split/hyperparam.csv
performing general cleaning
performing anova cleaning
performing outliers and correlations cleaning
splitting into sprints
preparing datasets for sprint 1, with target 'predict_accept'
X and y have same rowcount: 155727
preparing datasets for sprint 2, with target 'grade'
X and y have same rowcount: 155727
preparing datasets for sprint 3, with target 'sub_grade'
X and y have same rowcount: 155727
preparing datasets for sprint 4, with target 'int_rate'
X and y have same rowcount: 155727

preparing [test] data for all sprints
loading raw file dataset/split/test.csv
performing general cleaning
performing anova cleaning
performing outliers and correlations cleaning
splitting into sprints
preparing datasets for sprint 1, with target 'predict_accept'
X and y have same rowcount: 156108
preparing datasets for sprint 2, with target 'grade'
X and y have same rowcount: 156108
preparing datasets for sprint 3, with target 'sub_grade'
X and y have same rowcount: 156108
preparing datasets for sprint 4, with target 'int_rate'
X and y have same rowcount: 156108

In [91]:
if run_entire_notebook():
    !wc -l ./dataset/sprint_*/*
skipping optional operation

This results in the following files being created, with the data for each of the steps in our project.

Each of the datasets is tuned for the target we want to predict (remember that the ANOVA test will drop features that do not contribute meaningfully to predicting the target, so each of the sprints has its own distinct dataset, with different number of features, etc...)

(py39) edu@flex:~/turing-college/projects/sprint11-loans/dataset$ tree
├── [... other files]
├── sprint_1                     < -----  (predict accept load)
│   ├── hyperparam_X.csv              < ----- used for hyperparameter tuning (X)
│   ├── hyperparam_y.csv              < ----- used for hyperparameter tuning (y) 
│   ├── test_X.csv                    < ----- used for test (X)
│   ├── test_y.csv                    < ----- used for test (y)
│   ├── train_X.csv                   < ----- used for training (X)
│   └── train_y.csv                   < ----- used for training (y)
├── sprint_2                     < -----  (predict loan grade)
│   ├── hyperparam_X.csv
│   ├── hyperparam_y.csv
│   ├── test_X.csv
│   ├── test_y.csv
│   ├── train_X.csv
│   └── train_y.csv
├── sprint_3                     < -----  (predict loan subgrade)
│   ├── hyperparam_X.csv
│   ├── hyperparam_y.csv
│   ├── test_X.csv
│   ├── test_y.csv
│   ├── train_X.csv
│   └── train_y.csv
└── sprint_4                     < -----  (predict loan interest)
    ├── hyperparam_X.csv
    ├── hyperparam_y.csv
    ├── test_X.csv
    ├── test_y.csv
    ├── train_X.csv
    └── train_y.csv

From now on, we will use these files for all of the model training, validation and testing

Next steps¶

This is the end of sprint 0.

In this sprint, we have taken a look at the data, we have made multiple passess to clean it iteratively, and we have split it into datasets that are almost-ready to be used for prediction.

We are ready to jump into the next step: making our first predictive model.

Jump into the notebook for sprint1, whenever you are ready